Advanced Text Analysis

SICSS-Munich, Day 4


Session 1️⃣: Logistics

Valerie Hase (LMU Munich)

github.com/valeriehase

valerie-hase.com

Who are you?

Hands up 🤚 if you…

  • have a background in social science

  • have a background in computer science

  • mostly work with R

  • mostly work with Python

  • have applied automated text analysis before yesterday

  • ever used (word) embeddings

Who am I?

  • Postdoc at LMU, Department of Communication (previously: University of Zurich, LSE)

  • Focus:

    • Text-as-Data
    • Digital Trace Data
    • Digital Journalism
    • Crisis Communication
  • More: github.com/valeriehase & valerie-hase.com

  • Shoutout 🙌 and a big thank you to today’s teaching assistant Renata Topinková. More info on her here: renatatopinkova.github.io

What will you learn today?

  • Session 2️⃣: Going beyond bag of words: An introduction
  • Session 3️⃣: Vector Space Models
  • Session 4️⃣: Embeddings

What is the focus of this class?

  • ✅ Overview of approaches & tutorials📚
  • ✅ Hands-on application
  • ✅ Focus on
  • ❌: Underlying models & maths
  • ❌: Application in Python

Schedule

Time ⏰
09:00 - 10:30 1️⃣: Logistics, 2️⃣: Going beyond bag of words: An introduction, 3️⃣: Vector Space Models
10:30 - 10:45 Coffee break ☕
10:45 - 12:15 4️⃣: Embeddings
12:15 - 12:30 Introduction to group exercise 🤝
12:30 - 13:30 Lunch 🥗
13:30 - 15:45 Group exercise 🤝
15:45 - 16:00 Coffee break ☕
16:00 - 17:30 Talk by Stefanie Walter

Access to material

Packages

These are the packages that we will need today:

Code
install.packages(c("quanteda", "quanteda.textstats",
                   "tidyverse", "devtools",
                   "udpipe", "rsyntax",
                   "lsa", "text2vec", 
                   "irlba", "purrr",
                   "ggplot2", "conText"))

In addition, we need to install the following unpublished package directly from Github:

Code
devtools::install_github("quanteda/quanteda.corpora")

Literature recommendations

Overview pieces by Grimmer et al. (2022) and Jurafsky & Martin (2023).

Image of Book Cover Text as Data

Image of Book Cover Speech and Language Processing

Any questions? 🤔

References

Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as data: A new framework for machine learning and the social sciences. Princeton University Press.
Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf